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Abstract 

In  this  paper  we  present  the  dialogue¬ 
understanding  components  of  an  architec¬ 
ture  for  assisting  multi-human  conversa¬ 
tions  in  artifact-producing  meetings:  meet¬ 
ings  in  which  tangible  products  such  as 
project  planning  charts  are  created.  Novel 
aspects  of  our  system  include  multimodal 
ambiguity  resolution,  modular  ontology- 
driven  artifact  manipulation,  and  a  meeting 
browser  for  use  during  and  after  meetings. 

We  describe  the  software  architecture  and 
demonstrate  the  system  using  an  example 
multimodal  dialogue. 

1  Introduction 

Recently,  much  attention  has  been  focused  on 
the  domain  of  multi-person  meeting  under¬ 
standing.  Meeting  dialogue  presents  a  wide 
range  of  challenges  including  continuous  multi¬ 
speaker  automatic  speech  recognition  (ASR), 
2D  whiteboard  gesture  and  handwriting  recog¬ 
nition,  3D  body  and  eye  tracking,  and  multi¬ 
modal  multi-human  dialogue  management  and 
understanding.  A  significant  amount  of  re¬ 
search  has  gone  toward  understanding  the  prob¬ 
lems  facing  the  collection,  organization,  and 
visualization  of  meeting  data  (Moore,  2002; 
Waibel  et  al.,  2001),  and  meeting  corpora  like 
the  ICSI  Meeting  Corpus  (Janin  et  al.,  2003)  are 
being  made  available.  Continuing  research  in 
the  multimodal  meeting  domain  has  since  blos¬ 
somed,  including  ongoing  work  from  projects 
such  as  AMU  and  M4^,  and  efforts  from  sev¬ 
eral  institutions. 

Previous  work  on  automatic  meeting  un¬ 
derstanding  has  mostly  focused  on  surface- 
level  recognition,  such  as  speech  segmentation, 
for  obvious  reasons:  understanding  free  multi¬ 
human  speech  at  any  level  is  an  extremely  diffi¬ 
cult  problem  for  which  best  performance  is  cur¬ 
rently  poor.  In  addition,  the  primary  focus  for 
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applications  has  been  on  off-line  tools  such  as 
post-meeting  multimodal  information  browsing. 

In  parallel  to  such  efforts  we  are  applying 
dialogue-management  techniques  to  attempt  to 
understand  and  monitor  meeting  dialogues  as 
they  occur,  and  to  supplement  multimodal 
meeting  records  with  information  relating  to  the 
structure  and  purpose  of  the  meeting. 

Our  efforts  are  focused  on  assisting  artifact- 
producing  meetings,  i.e.  meetings  for  which  the 
intended  outcome  is  a  tangible  product  such  as 
a  project  management  plan  or  a  budget.  The 
dialogue-understanding  system  helps  to  create 
and  manipulate  the  artifact,  delivering  a  final 
product  at  the  end  of  the  meeting,  while  the 
state  of  the  artifact  is  used  as  part  of  the  dia¬ 
logue  context  under  which  interpretation  of  fu¬ 
ture  utterances  is  performed,  serving  a  num¬ 
ber  of  useful  roles  in  the  dialogue-understanding 
process: 

•  The  dialogue  manager  employs  generic  di¬ 
alogue  moves  with  plugin  points  to  be  de¬ 
fined  by  specific  artifact  types,  e.g.  project 
plan,  budget; 

•  The  artifact  state  helps  resolve  ambiguity 
by  providing  evidence  for  multimodal  fu¬ 
sion  and  constraining  topic-recognition; 

•  The  artifact  type  can  be  used  to  bias  ASR 
language-models] 

•  The  constructed  artifact  provides  a  inter¬ 
face  for  a  meeting  browser  that  supports 
directed  queries  about  discussion  that  took 
place  in  the  meeting,  e.g.  “Why  did  we 
decide  on  that  date?” 

In  addition,  we  focus  our  attention  on  the 
handling  of  ambiguities  produced  on  many 
levels,  including  those  produced  during  au¬ 
tomatic  speech  recognition,  multimodal  com¬ 
munication,  and  artifact  manipulation.  The 
present  dialogue  manager  uses  several  tech¬ 
niques  to  do  this,  including  the  maintenance  of 
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Figure  1:  The  meeting  assistant  architecture, 
highlighting  the  dialogue-management  compo¬ 
nents. 

multiple  dialogue-move  hypotheses,  fusion  with 
multimodal  gestures,  and  the  incorporation  of 
artifact-specific  plug-ins. 

The  software  architecture  we  use  for  manag¬ 
ing  multi-human  dialogue  is  an  enhancement  of 
a  dialogue-management  toolkit  previously  used 
at  CSLI  in  a  range  of  applications,  including 
command-and-control  of  autonomous  systems 
(Lemon  et  ah,  2002)  and  intelligent  tutoring 
(Clark  et  ah,  2002).  In  this  paper,  we  detail  the 
dialogue-management  components  (Section  3), 
which  support  a  larger  project  involving  mul¬ 
tiple  collaborating  institutions  (Section  2)  to 
build  a  multimodal  meeting-understanding  sys¬ 
tem  capable  of  integrating  speech,  drawing  and 
writing  on  a  whiteboard,  and  physical  gesture 
recognition. 

We  also  describe  our  toolkit  for  on-line  and 
off-line  meeting  browsing  (Section  4),  which  al¬ 
lows  a  meeting  participant,  observer,  or  devel¬ 
oper  to  visually  and  interactively  answer  ques¬ 
tions  about  the  history  of  a  meeting,  the  pro¬ 
cesses  performed  to  understand  it,  and  the 
causal  relationships  between  dialogue  and  ar¬ 
tifact  manipulation. 

2  Meeting  Assistant  Architecture 

The  complete  meeting  assistant  architecture  is 
a  highly  collaborative  effort  from  several  insti¬ 
tutions.  Its  overall  architecture,  focusing  on  our 
contributions  to  the  system  is  illustrated  in  Fig¬ 
ure  1. 

The  components  for  drawing  and  writing 
recognition  and  multimodal  integration  (Kaiser 
et  ah,  2003)  were  developed  at  The  Oregon 


Graduate  Institute  (OGI)  Genter  for  Human- 
Gomputer  Gommunication^;  the  component  for 
physical  gesture  recognition  (Ko  et  ah,  2003) 
was  developed  at  The  Massachusetts  Institute 
of  Technology  (MIT)  AI  Lab^.  Integration  be¬ 
tween  all  components  was  performed  by  project 
members  at  those  sites  and  at  SRI  Interna¬ 
tional^,  and  integration  between  our  GSLI  Gon- 
versational  Intelligence  Architecture  and  OGTs 
Multimodal  Integrator  (MI)  was  performed  by 
members  of  both  teams.  ASR  is  done  using 
GMU  Sphinx®,  from  which  the  n-best  list  of  re¬ 
sults  are  passed  to  SRTs  Gemini  parser  (Dowd- 
ing  et  ah,  1993).  Gemini  incorporates  a  suite 
of  techniques  for  handling  noisy  input,  includ¬ 
ing  fragment  detection,  and  its  dynamic  gram¬ 
mar  capabilities  are  used  to  register  new  lexical 
items,  such  as  names  of  tasks  that  may  be  out- 
of- grammar. 

An  example  of  a  multimodal  meeting  conver¬ 
sation  that  the  meeting  assistant  currently  sup¬ 
ports  can  be  found  in  Figure  27  There  are  two 
meeting  participants  in  a  conference  room  with 
an  electronic  whiteboard  which  can  record  their 
pen  strokes  and  a  video  camera  that  tracks  their 
body  movements;  A  is  standing  at  the  white¬ 
board  and  drawing  while  B  is  sitting  at  the 
table.  A  gloss  of  how  the  system  behaves  in 
response  to  each  utterance  and  gesture  follows 
each  utterance;  these  glosses  will  be  explained 
in  greater  detail  throughout  the  rest  of  the  pa¬ 
per.  The  drawing  made  on  the  whiteboard  is 
in  Figure  3(a),  and  the  chart  artifact  as  it  was 
constructed  by  the  system  is  displayed  in  Figure 
3(b). 

3  Conversational  Intelligence 
Architectnre 

To  meet  the  challenges  presented  by  multi¬ 
person  meeting  dialogue,  we  have  extended 
and  enhanced  our  previously  used  Gonversa- 
tional  Intelligence  Architecture  (GIA).  The  GIA 
is  a  modular  and  highly  configurable  multi¬ 
application  system:  a  separation  is  made  be¬ 
tween  generic  dialogue  processes  and  those  spe¬ 
cific  to  a  particular  domain.  Greating  a  new 
application  may  involve  writing  new  dialogue 
moves  and  configuring  the  GIA  to  use  these.  We 

®http://www.cse. ogi.edu/CHCC/ 
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A:  So,  lets  uh  figure  out  what  uh  needs  uh  needs  to  be  done.  Let’s 
look  at  the  schedule,  [draws  a  chart  axes]  utterance  and  gesture 
information  fused,  a  new  milestone  chart  artifact  is  created 

B:  So,  if  all  goes  well,  we’ve  got  funding  for  five  years,  system  sets 
unit  on  axis  to  “years” 

A\  Yeah.  Let’s  see  one,  two  ...  [draws  five  tick  marks  on  the  x-axis] 
system  assumes  tick  marks  are  years 

B:  Well,  the  way  I  see  it,  uh  we’ve  got  three  tasks,  dialogue  man¬ 
ager  hypothesizes  three  tasks  should  be  added,  waits  for  multimodal 
confirmation 

A:  Yeah  right  [draws  three  task  lines  horizontally  on  the  axis]  multi¬ 
modal  confirmation  is  given,  information  about  task  start  and  end 
dates  is  fused  from  the  drawing 

A:  Let’s  call  this  task  line  demo  [touches  the  top  line  with  the 
pen],  call  this  task  line  signoff  [touches  the  middle  line  with  the 
pen],  and  call  this  task  line  system  [touches  the  bottom  line  with 
the  pen],  each  utterance  causes  the  dialogue  manager  to  hypoth¬ 
esize  three  distinct  hypotheses,  in  each  task  a  different  hypothesis 
is  named,  the  gestures  disambiguate  these  in  the  multimodal  inte¬ 
grator 

B:  So  we  have  two  demos  to  get  done. 

A:  uh  huh 

B:  Darpatech  is  at  the  end  of  month  fifteen  [A  draws  a  diamond 
at  month  fifteen  on  the  demo  task  line]  dialogue  manager  hy¬ 
pothesizes  a  milestone  called  “darpatech”  at  month  fifteen;  gesture 
confirms  this  and  pinpoints  appropriate  task  line 

B:  And  the  final  demonstrations  are  at  the  end  of  year  five  [A  draws 
a  diamond  at  year  five  on  the  demo  task  line]  same  processing  as 
previous 

A:  Hmm,  so  when  do  the  signoffs  need  to  happen  do  you  think? 
dialogue  manager  expects  next  utterance  to  be  an  answer 

B:  Six  months  before  the  demos  [A  draws  two  diamonds  on  the  sig¬ 
noff  task  line,  each  one  about  6  months  before  the  demo  mile¬ 
stones  drawn  above]  answer  arrives;  dialogue  manager  hypothe¬ 
sizes  two  new  milestones  which  are  confirmed  by  gesture 

A:  And  we’ll  need  the  systems  by  then  too  \A  draws  two  diamonds 
on  the  system  task  line]  dialogue  manager  hypothesizes  two  more 
milestones,  confirmed  by  gesture 

B:  That’s  a  bit  aggressive  I  think.  Let’s  move  the  system  milestone 
back  six  months.  [B  points  finger  at  rightmost  system  milestone. 
A  crosses  it  out  and  draws  another  one  six  months  earlier]  di¬ 
alogue  manager  hypothesizes  a  move  of  the  milestone,  3D  gesture 
and  drawing  confirm  this 


Figure  2:  Example  conversation  understood  by 
the  system. 


(a)  The  whiteboard  in¬ 
put  captured  by  OGI’s 
Charter  gesture  recog¬ 
nizer 


(b)  The  artifact  as 
maintained  in  the 
dialogue  system 


Figure  3:  Ink-captured  vs  ‘idealized’  artifact 
output. 


have  successfully  used  this  “toolkit”  approach 
in  our  previous  applications  at  CSLI  to  inter¬ 
face  novel  devices  without  modifying  the  core 
dialogue  manager. 

The  present  application  is  however  very  dif¬ 


ferent  to  our  previous  applications,  and  those 
commonly  encountered  in  the  literature,  which 
typically  involve  a  single  human  user  interact¬ 
ing  with  a  dialogue-enabled  artificial  agent.  In 
the  meeting  environment,  the  dialogue  manager 
should  at  most  very  rarely  interpose  itself  into 
the  discussion — to  do  so  would  be  disruptive 
to  the  interaction  between  the  human  partic¬ 
ipants.  This  requirement  prohibits  ambiguity 
and  uncertainty  from  being  resolved  with,  say, 
a  clarification  question,  which  is  the  usual  strat¬ 
egy  in  conversational  interfaces.  Instead,  uncer¬ 
tainty  must  be  maintained  in  the  system  until 
it  can  be  resolved  by  leveraging  context,  using 
evidence  from  another  modality,  or  by  a  future 
utterance. 

The  meeting-understanding  domain  has  thus 
prompted  several  extensions  to  our  existing 
CIA,  many  of  which  we  expect  will  be  applied 
in  other  conversational  domains.  These  include: 

•  Support  for  handling  multiple  competing 
speech  parses;  (Section  3.2) 

•  A  generic  artifact  ontology  which  enables 
designing  generically  useful  artifact-savvy 
dialogue  applications;  (Section  3.3) 

•  Support  for  the  generation  and  subsequent 
confirmation  of  dialogue-move  hypotheses 
in  a  multimodal  integration  framework; 
(Section  3.4) 

•  The  acceptance  of  non-verbal  unimodal 
gestures  into  the  dialogue-move  repertoire. 
(Section  3.5) 

•  A  preliminary  mechanism  for  supporting 
uncertainty  across  multiple  conversational 
moves;  (Section  3.6) 

Before  discussing  these  new  features  in  detail, 
the  following  section  introduces  the  CIA  and 
its  persisting  core  dialogue-management  com¬ 
ponents. 

3.1  Core  Components:  Information 
State  and  Context 

The  core  dialogue  management  components  of 
the  CIA  maintain  dialogue  context  using  the 
information- state  and  dialogue-move  approach 
(Larsson  and  Traum,  2000)  where  each  con¬ 
tributed  utterance  modifies  the  current  context, 
or  information  state,  of  the  dialogue.  Each  new 
utterance  is  then  interpreted  within  the  current 
context  (see  (Lemon  et  ah,  2002)  for  a  detailed 
description). 


A  number  of  data  structures  are  employed 
in  this  process.  The  central  dialogue  state- 
maintaining  structure  is  the  Dialogue  Move  Tree 
(DMT).  The  DMT  represents  the  historical  con¬ 
text  of  a  dialogue.  An  incoming  utterance,  clas¬ 
sified  by  dialogue  move,  is  interpreted  in  con¬ 
text  by  attaching  itself  to  an  appropriate  active 
node  on  the  DMT;  e.g.,  an  answer  attaches  to 
an  active  corresponding  question  node.  Cur¬ 
rently,  active  nodes  are  kept  on  an  Aetive  Node 
List,  which  is  ordered  so  that  those  most  likely 
to  be  relevant  to  the  current  conversation  are 
at  the  front  of  the  list.  Incoming  utterances 
are  displayed  to  each  node  in  turn,  and  at¬ 
tach  to  the  first  appropriate  node  (determined 
by  information-state- update  functions).  Other 
structures  include  the  context-specific  Salienee 
List,  which  maintains  recently  used  terms  for 
performing  anaphora  resolution,  and  a  Knowl¬ 
edge  Base  containing  application  specific  infor¬ 
mation,  which  may  be  leveraged  to  interpret  in¬ 
coming  utterances.® 

We  now  present  the  various  enhancements 
made  to  the  CIA  for  use  in  the  meeting  domain. 

3.2  ASR  and  Robust  Parsing 

The  first  step  in  understanding  any  dialogue  is 
recognizing  and  interpreting  spoken  utterances. 
In  the  meeting  domain,  we  are  presented  with 
the  particularly  difficult  task  of  doing  this  for 
spontaneous  human- human  speech.  We  there¬ 
fore  chose  to  perform  ASR  using  a  statisti¬ 
cal  language  model  (LM)  and  employ  CMU’s 
Sphinx  to  generate  an  n-best  list  of  recogni¬ 
tion  results.  The  recognition  engine  uses  a  tri¬ 
gram  LM  trained  on  the  complete  set  of  pos¬ 
sible  utterances  expected  given  a  small  hand¬ 
crafted  scenario  like  that  in  the  example  dia¬ 
logue.  Despite  the  task’s  limited  domain,  the  re¬ 
alized  speech  is  very  disfluent,  generating  an  ex¬ 
tremely  broad  range  of  possible  utterances  that 
the  system  must  handle.  The  resulting  n-best 
list  is  therefore  often  extremely  varied. 

To  handle  the  ASR  results  of  disfluent  utter¬ 
ances,  we  employ  SRI’s  Gemini  robust  language 
parser  (Dowding  et  ah,  1993).  In  particular, 
we  use  Gemini  to  retrieve  the  longest  strings 
of  valid  5  and  NP  fragments  in  each  ASR  re¬ 
sult.  Currently,  we  reject  all  but  the  parsed  S 
fragments — and  NP  fragments  when  expected 

®  Command- and-control  applications  have  also  made 
use  of  an  Activity  Tree,  which  represents  activities  being 
carried  out  by  the  dialogue-enabled  device  (Gruenstein, 
2002);  however,  this  application  currently  makes  no  use 
of  this. 


by  the  system  {e.g.  an  answer  to  a  question 
containing  an  NP  gap).  The  parser  uses  generic 
syntactic  rules,  but  is  constrained  semantically 
by  sorts  specific  to  the  domain.  In  Section  3.4, 
we  describe  how  the  dialogue  manager  handles 
the  multiple  parses  for  a  single  utterance  and 
how  it  uses  the  uncertainty  they  represent. 

3.3  Artifact  Knowledge  Base  and 
Ontology 

In  the  present  version  of  the  CIA,  all  static  do¬ 
main  knowledge  about  meeting  artifacts  is  de¬ 
fined  in  a  modularized  class-based  ontology.  In 
conjunction  with  the  ontology,  we  also  maintain 
a  dynamic  knowledge  base  (KB)  which  holds 
the  current  state  of  any  artifacts.  This  is  stored 
as  a  collection  of  instances  of  the  ontological 
classes,  and  both  components  are  maintained 
together  using  the  Protege-2000®  ontology  and 
knowledge-base  toolkit  (Grosso  et  ah,  1999). 

The  principal  base  classes  in  the  artifact  on¬ 
tology  are  designed  to  be  both  architecturally 
elegant  and  intuitive.  To  this  end,  we  charac¬ 
terize  the  world  of  artifacts  as  being  made  up 
of  three  essential  classes:  entities  which  repre¬ 
sent  the  tangible  objects  themselves,  relations 
which  represent  how  the  entities  relate  to  one 
another,  and  events  which  change  the  state  of 
entities  or  relations.  Events  are  the  most  im¬ 
portant  tool  aiding  the  dialogue  management 
algorithm.  They  comprehensively  characterize 
the  set  of  actions  which  can  change  the  current 
state  of  an  artifact.  They  may  be  classified  into 
three  categories:  insert  ehanges  which  insert  a 
new  entity  or  relation  instance  into  the  KB,  re¬ 
move  ehanges  which  remove  an  instance,  and 
value  ehanges  which  modify  the  value  of  a  slot 
in  an  instance.  All  changes  to  the  KB  can  be 
characterized  as  one  of  these  three  atomic  events 
or  a  combination  of  them. 

3.4  Hypothesizers:  A  plugin 
architecture  for  artifact-driven 
multimodal  integration 

Abmiguities  and  uncertainties  are  both  ram¬ 
pant  in  multimodal  meeting  dialogues,  and  in 
artifact-producing  meetings,  the  majority  per¬ 
tain  to  artifacts  and  the  utterances  performed 
to  change  them.  In  this  section  we  explain  how 
the  CIA’s  dialogue  manager  uses  the  artifact  on¬ 
tology,  and  the  repertoire  of  event  classes  in  it, 
to  formulate  sets  of  artifact-changing  dialogue- 
move  hypotheses  from  single  utterances.  We 

®http:  /  /  protege.stanford.edu/ 


also  demonstrate  how  it  uses  the  current  state 
of  the  artifact  in  the  KB  to  constrain  the  in¬ 
terpretation  of  utterances  in  context,  and  how 
multimodal  gestures  help  to  resolve  ambiguous 
interpretations. 

To  begin,  each  dialogue-move  hypothesis  con¬ 
sists  of  the  following  elements:  (1)  the  DMT 
node  associated  with  this  hypothesis,  (2)  the 
parse  that  gave  rise  to  the  hypothesis,  (3)  the 
probability  of  the  hypothesis,  (4)  an  isUnimodal 
flag  indicating  whether  or  not  the  dialogue  move 
requires  confirmation  from  other  modalities,  (5) 
a  list  of  artifaet-ehange  events  to  be  made  to 
the  KB,  and  (6)  the  information  state  update 
funetion  to  be  invoked  if  this  hypothesis  is  con¬ 
firmed  by  the  multimodal  integrator.  Each  of 
these  elements  participate  in  the  generation  and 
confirmation  process  as  detailed  below. 

First,  consider  the  utterance  Darpateeh  is  at 
the  end  of  month  fifteen,  from  the  example  dia¬ 
logue.  This  utterance  is  much  more  likely  to 
indicate  the  creation  of  a  new  milestone  if  a 
task  line  is  pertinent  to  the  current  dialogue 
context,  e.g.  the  user  has  just  created  a  new 
task  line.  In  our  system,  the  ambiguous  or  un¬ 
certain  utterance,  the  current  dialogue  context, 
and  the  current  state  of  the  chart  is  delegated  to 
artifact-type  specific  components  called  hypoth- 
esizers.  Hypothesizers  take  the  above  as  input, 
and  using  the  set  of  events  available  to  its  cor¬ 
responding  artifact  in  the  ontology,  they  pro¬ 
grammatically  generate  a  list  of  dialogue-move 
hypotheses  appropriate  in  the  given  context — 
or  they  can  return  the  empty  list  to  indicate 
that  there  is  no  reasonable  interpretation  of  the 
utterance  given  the  current  context. 

Hypothesizers  work  directly  with  the  DMT 
architecture:  as  an  incoming  utterance  is  se¬ 
quentially  presented  to  each  active  node  in  the 
DMT,  the  dialogue  context  and  the  proposed 
active  node  are  passed  into  a  hypothesizer  cor¬ 
responding  to  the  particular  artifact  associated 
with  that  node.  If  the  hypothesizer  can  create 
one  or  more  valid  hypotheses,  then  the  utter¬ 
ance  is  attached  to  the  DMT  as  a  child  of  that 
active  node.^® 

In  a  multimodal  domain,  some  hypotheses  re¬ 
quire  confirmation  in  other  modalities  before 
the  dialogue  manager  can  confidently  update 

^°There  are,  in  fact,  other  rules  as  well  which  allow  for 
attachment.  For  example,  questions — which  don’t  im¬ 
mediately  generate  hypotheses — can  also  be  attached  to 
various  nodes  depending  on  the  dialogue  context.  While 
the  emphasis  here  is  on  hypothesizers,  these  are  just  one 
part  of  the  dialogue  processing  toolkit 


the  information  state.  In  this  particular  system, 
in  fact,  the  dialogue  manager  does  not  directly 
update  the  KB’s  current  artifact  state;  rather,  it 
hypothesizes  a  set  of  dialogue-move  hypotheses 
and  assigns  each  a  confidence  derived  from  ASR 
confidence,  the  fragmentedness  of  the  parse,  and 
confidence  in  the  proposed  attachment  to  a  con¬ 
versational  thread.  Each  conversational  move  is 
then  provided  a  Hypothesis  Repository  for  stor¬ 
ing  the  hypotheses  associated  with  it.  When 
dialogue  processing  is  completed  for  a  partic¬ 
ular  conversational  move,  i.e.  when  all  pos¬ 
sible  attachments  of  all  possible  parses  on  the 
n-best  list  have  been  made,  the  set  of  hypothe¬ 
ses  is  sent  to  the  Multimodal  Integrator  (MI) 
for  potential  fusion  with  gesture.  Depending  on 
the  information  from  other  modalities,  the  MI 
confirms  or  rejects  the  hypotheses — moreover,  a 
confirmed  hypothesis  might  be  augmented  with 
information  provided  by  other  modalities.  Such 
an  augmentation  occurs  for  the  utterance  We 
have  three  tasks  from  the  example  dialogue.  In 
this  situation,  the  dialogue  manager  hypothe¬ 
sizes  that  the  user  may  be  creating  three  new 
task  lines  on  the  chart.  When  the  user  actually 
draws  the  three  task  lines,  the  MI  infers  the 
start  and  stop  date  based  on  where  the  lines 
start  and  stop  on  the  axis.  In  this  case,  it  not 
only  confirms  the  dialogue  manager’s  hypoth¬ 
esis,  but  augments  it  to  reflect  the  additional 
date  information  yielded  from  the  whiteboard 
input. 

3.5  Unimodal  Gestures 

In  addition  to  the  Information  State  updates 
based  on  both  speech  and  gesture,  multimodal 
meeting  dialogue  can  often  include  gestures  in 
which  a  participant  makes  a  change  to  an  ar¬ 
tifact  using  a  unimodal  gesture  not  associated 
with  an  utterance.  For  example,  a  user  may 
draw  a  diamond  on  a  task  line  but  say  nothing. 
Even  in  the  absence  of  speech,  this  can  be  unam¬ 
biguously  understood  as  the  creation  of  a  mile¬ 
stone  at  a  particular  point  on  the  line.  These 
unimodally  produced  changes  to  the  chart  must 
be  noted  by  the  dialogue  manager,  as  they  are 
potential  targets  for  later  conversation.  To  ac¬ 
commodate  this,  we  introduce  a  new  DMT  node 
of  type  Unimodal  Gesture,  thus  implicitly  in¬ 
cluding  gesture  as  a  communicative  act  that  can 
stand  on  its  own  in  a  conversation 

3.6  Uncertain  DMT  Node  Attachment 

Since  hypotheses  are  not  always  immediately 
confirmed,  uncertainty  must  be  maintained 


Figure  4:  A  snapshot  from  the  meeting  browser. 

across  multiple  dialogue  moves.  The  system  ac¬ 
complishes  this  by  extending  the  CIA  to  main¬ 
tain  multiple  competing  Information  States.  In 
particular,  the  DMT  has  been  extended  to  al¬ 
low  for  the  same  parse  to  attach  in  multiple 
locations — these  multiple  attachments  are  even¬ 
tually  pruned  as  more  evidence  is  accumulated 
in  the  form  of  further  speech  or  gestures — that 
is,  as  hypotheses  are  confirmed  or  rejected  over 
time. 

4  Meeting  Viewer  Toolkit 

Throughout  an  artifact-producing  meeting,  the 
dialogue  system  processes  a  complex  chronolog¬ 
ical  sequence  of  events  and  information  states 
that  form  structures  rich  in  information  useful 
to  dialogue  researchers  and  the  dialogue  partic¬ 
ipants  themselves.  To  harness  the  power  of  this 
information,  we  have  constructed  a  toolkit  for 
visualizing  and  investigating  the  meeting  infor¬ 
mation  state  and  its  history. 

Central  to  the  toolkit  is  our  meeting  history 
browser,  which  can  be  seen  in  Figure  4,  dis¬ 
playing  a  portion  of  the  example  dialogue,  with 
the  results  of  a  search  for  “demo”  highlighted. 
This  record  of  the  meeting  is  available  both  dur¬ 
ing  the  meeting  and  afterwards  to  assist  users 
in  answering  questions  they  might  have  about 
the  meeting.  Many  kinds  of  questions  can  be 
answered  in  the  browser,  like  those  a  manager 
might  ask  the  day  after  a  meeting:  “Why  did  we 
move  the  deadline  on  that  task  6  months  later?”, 
’’Did  I  approve  setting  that  deadline  so  early?”, 
and  “What  were  we  thinking  when  we  put  that 
milestone  at  month  fifteen?”.  A  meeting  partic¬ 
ipant  might  have  questions  as  the  meeting  oc¬ 
curs,  like  “What  did  the  ehart  look  like  5  min¬ 
utes  ago?”,  “What  did  we  say  to  make  the  sys¬ 


tem  move  that  milestone?” ,  and  “What  did  Mr. 
Smith  say  at  the  beginning  of  the  meeting?”. 

To  help  answer  these  questions,  the  browser 
performs  many  of  the  functions  found  in  current 
multimodal  meeting  browsers.  For  example, 
it  provides  concise  display  of  a  meeting  tran¬ 
scription,  advanced  searching  capabilities,  sav¬ 
ing  and  loading  of  meeting  sessions,  and  person¬ 
alization  of  its  own  display  characteristics.  As 
a  novel  addition  to  these  basic  behaviors,  the 
browser  is  also  designed  to  display  artifacts  and 
the  causal  relationships  between  artifacts  and 
the  utterances  that  cause  them  to  change. 

To  effectively  convey  this  information,  the 
record  of  components  monitored  by  the  history 
toolkit  is  presented  to  the  user  through  a  win¬ 
dow  which  chronologically  displays  the  visual 
embodiment  of  those  components.  Recognized 
utterances  are  shown  as  text,  parses  are  shown 
as  grouped  string  fragments,  and  artifacts  and 
their  sub-components  are  shown  in  their  pro¬ 
totypical  graphical  form.  The  window  orga¬ 
nizes  these  visual  representations  of  the  meet¬ 
ing’s  events  and  states  into  chronological  tracks, 
each  of  which  monitors  a  unified  conceptual  part 
of  the  meeting.  The  user  is  then  able  to  link  the 
elements  causally. 

Beyond  the  history  browser,  the  toolkit  also 
displays  the  current  state  of  all  artifacts  in  an 
artifaet-state  window  (e.g.  Figure  3(b)).  In  the 
window,  the  user  not  only  confirms  the  state  of 
the  artifact  but  can  also  gain  insight  into  the 
currently  interpreted  dialogue  context  by  mon¬ 
itoring  how  the  artifact  is  highlighted.  In  the 
figure,  the  third  task  is  highlighted  because  it  is 
the  most  recently  talked-about  task.  A  meeting 
participant  can  therefore  see  that  subsequent 
anaphoric  references  to  an  unknown  task  will 
be  resolved  to  the  third  one. 

Another  GUI  component  of  the  toolkit  is 
a  small  hypothesis  window  which  shows  the 
current  set  of  unresolved  artifact-changing  hy¬ 
potheses.  It  does  this  by  displaying  an  artifact 
for  each  hypothesis,  reflecting  the  artifact’s  fu¬ 
ture  state  given  confirmation  of  the  hypothe¬ 
sis.  The  hypothesis’  probability  and  associated 
parse  is  displayed  under  the  artifact.  The  user 
may  even  directly  click  a  hypothesis  to  confirm 
it.  The  hypothesized  future  states  are  however 
not  displayed  in  the  artifact-state  window  or 
artifact-history  browser,  which  show  only  the 
results  of  confirmed  actions. 

In  addition  to  being  a  GUI  front-end,  the 
toolkit  maintains  a  fully  generic  architecture  for 


recording  the  history  of  any  object  in  the  sys¬ 
tem  software.  These  objects  can  be  anything 
from  the  utterances  of  a  participant,  to  the  state 
history  of  an  artifact  component,  or  the  record 
of  hypotheses  formulated  by  the  dialogue  man¬ 
ager.  This  generic  functionality  provides  the 
toolkit  the  ability  to  answer  a  wide  variety  of 
questions  for  the  user  about  absolutely  any  as¬ 
pect  of  the  dialogue  context  history. 

5  Future  Work 

Work  is  currently  proceeding  in  a  number  of 
directions.  Firstly,  we  plan  to  incorporate  fur¬ 
ther  techniques  for  robust  language  understand¬ 
ing,  including  word-spotting  and  other  topic- 
recognition  techniques,  within  the  context  of 
the  constructed  artifact.  We  also  plan  to  in¬ 
vestigate  using  the  current  state  of  the  artifact 
to  further  bias  the  ASR  language  model.  We 
also  plan  on  generalizing  the  uncertainty  man¬ 
agement  within  the  dialogue  manager,  allowing 
multiple  competing  hypotheses  to  be  supported 
over  multiple  dialogue  moves.  Topic  and  other 
ambiguity  management  techniques  will  be  used 
to  statistically  filter  and  bias  hypotheses,  based 
on  artifact  state. 

We  are  currently  expanding  the  meeting 
browser  to  categorize  utterances  by  dialogue 
act,  and  to  recognize  and  categorize  aggrega¬ 
tions  as  multi-move  strategies,  such  as  negoti¬ 
ations.  This  will  allow  at-a-glance  detection  of 
where  disagreements  took  place,  and  where  is¬ 
sues  may  have  been  left  unresolved.  A  longer- 
term  aim  of  the  project  is  to  provide  further 
support  to  the  participants  in  the  meeting,  e.g. 
by  detecting  opportunities  to  provide  useful  in¬ 
formation  (e.g.  schedules,  when  discussing  who 
to  allocate  to  a  task;  documents  pertinent  to  a 
topic  under  discussion)  to  meeting  participants 
automatically.  Evaluation  criteria  are  currently 
being  designed  that  include  both  standard  mea¬ 
sures,  such  as  word  error  rate,  and  measures  in¬ 
volving  recognition  of  meeting-level  phenomena, 
such  as  detecting  agreement  on  action-items. 
Evaluation  will  be  performed  using  both  corpus- 
based  approaches  (e.g.  for  evaluating  recog¬ 
nition  of  meeting  phenomena)  and  real  (con¬ 
trolled)  meetings  with  human  subjects. 
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